A powerful open-source medical foundation model that unifies visual grounding, clinical reasoning, and language understanding across diverse medical imaging modalities.
Mohamed bin Zayed University of Artificial Intelligence
Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data.
MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios.
MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks, achieving state-of-the-art performance in medical VQA, report generation, and diagnostic reasoning. Evaluations across radiology, ophthalmology, pathology, and emergency care confirm MedMO's broad cross-modality generalization and reliable spatial reasoning.
MedMO achieves state-of-the-art results across diverse medical imaging tasks
Addressing critical limitations in existing medical MLLMs
Most existing models rely on distilled data from proprietary models, which often lack accurate domain grounding for fine-grained clinical reasoning.
Distillation pipelines without structured supervision amplify hallucination risks and inconsistencies in medical outputs.
Current models focus on individual tasks or narrow modality subsets rather than achieving unified, cross-modal generalization.
Progressive post-training for comprehensive medical image understanding
Align heterogeneous visual encoders with a medical language backbone using DeepStack fusion mechanism.
Training spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes.
Novel bounding-box GIoU reward combined with factuality checks for enhanced spatial grounding.
Built upon Qwen3-VL with a modular design enabling future expansion across additional modalities.
State-of-the-art performance across medical VQA, Text QA, and Grounding tasks
MedMO-8B achieves the best overall balance among open-source models, outperforming both Lingshu-7B and Fleming-VL-8B with the strongest Text-QA results (+14.6% over Fleming-VL) while maintaining competitive VQA performance within 1.8% of SOTA.
| Model | MMMU-Med | VQA-RAD | SLAKE | PathVQA | PMC-VQA | OmniMedVQA | MedXQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 75.2 | 65.0 | 72.2 | 55.5 | 55.2 | 75.5 | 45.2 | 63.4 |
| Claude Sonnet 4 | 74.6 | 67.6 | 70.6 | 54.2 | 54.4 | 65.5 | 43.3 | 61.5 |
| Gemini-2.5-Flash | 76.9 | 68.5 | 75.8 | 55.4 | 55.4 | 71.0 | 52.8 | 65.1 |
| Fleming-VL-8B | 63.3 | 66.1 | 86.5 | 62.9 | 64.3 | 86.7 | 21.6 | 64.4 |
| Lingshu-7B | 54.0 | 67.9 | 83.1 | 61.9 | 56.3 | 82.9 | 26.7 | 61.8 |
| Qwen3VL-8B (Baseline) | 61.4 | 64.1 | 47.3 | 14.6 | 52.3 | 77.2 | 24.8 | 48.8 |
| MedMO-4B (Ours) | 54.6 | 50.9 | 41.0 | 62.4 | 50.6 | 79.7 | 24.8 | 52.0↑+3.2 |
| MedMO-8B (Ours) | 64.6 | 64.7 | 81.6 | 56.3 | 59.4 | 84.8 | 26.9 | 62.6↑+13.8 |
| Model | MMLU-Med | PubMedQA | MedMCQA | MedQA | Medbullets | MedXQA | SGPQA | Avg. |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 89.6 | 75.6 | 77.7 | 89.1 | 77.0 | 30.9 | 49.9 | 70.0 |
| Claude Sonnet 4 | 91.3 | 78.6 | 79.3 | 92.1 | 80.2 | 33.6 | 56.3 | 73.1 |
| Gemini-2.5-Flash | 84.2 | 73.8 | 73.6 | 91.2 | 77.6 | 35.6 | 53.3 | 69.9 |
| Fleming-VL-8B | 71.8 | 74.0 | 51.8 | 53.7 | 40.5 | 12.1 | 24.9 | 46.9 |
| Lingshu-7B | 74.5 | 76.6 | 55.9 | 63.3 | 56.2 | 16.5 | 26.3 | 52.8 |
| Qwen3VL-8B (Baseline) | 79.3 | 70.4 | 60.0 | 66.1 | 56.1 | 15.1 | 34.7 | 54.5 |
| MedMO-4B (Ours) | 75.7 | 78.0 | 58.0 | 78.5 | 57.5 | 16.4 | 29.4 | 56.2↑+1.7 |
| MedMO-8B (Ours) | 82.2 | 76.8 | 65.0 | 83.8 | 65.2 | 20.4 | 37.2 | 61.5↑+7.0 |
Semantic (ROUGE-L, CIDEr) and model-based (RaTE, Semb) metrics
| Model | MIMIC-CXR | CheXpert Plus | IU-Xray | Med-Trinity | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | R-L | CIDEr | RaTE | Semb | |
| GPT-4.1 | 9.0 | 82.8 | 51.3 | 23.9 | 24.5 | 78.8 | 45.5 | 23.2 | 30.2 | 124.6 | 51.3 | 47.5 | – | – | – | – |
| Gemini-2.5-Flash | 25.4 | 80.7 | 50.3 | 29.7 | 23.6 | 72.2 | 44.3 | 27.4 | 33.5 | 129.3 | 55.6 | 50.9 | – | – | – | – |
| Lingshu-7B | 30.8 | 109.4 | 52.1 | 30.0 | 26.5 | 79.0 | 45.4 | 26.8 | 41.2 | 180.7 | 57.6 | 48.4 | 16.0 | 74.5 | 44.4 | 24.0 |
| Fleming-VL-8B | 35.7 | 132.5 | 56.7 | 33.6 | 26.1 | 82.2 | 47.1 | 40.1 | 44.9 | 198.6 | 66.0 | 51.3 | 13.1 | 35.8 | 41.9 | 18.1 |
| Qwen3VL-8B (Baseline) | 25.1 | 77.9 | 50.3 | 33.4 | 21.9 | 67.4 | 44.4 | 37.9 | 25.0 | 91.4 | 52.5 | 42.9 | 20.2 | 69.9 | 45.9 | 33.6 |
| MedMO-4B (Ours) | 26.0 | 92.6 | 49.8 | 31.6 | 15.1 | 62.3 | 36.6 | 34.2 | 26.6 | 94.0 | 42.1 | 41.3 | 22.5 | 152.6 | 47.8 | 34.3 |
| MedMO-8B (Ours) | 31.7 | 140.0 | 57.1 | 50.0 | 23.6 | 87.5 | 47.3 | 42.2 | 31.1 | 169.7 | 45.3 | 41.3 | 37.0 | 270.4 | 53.0 | 39.2 |
| Model | NIH Chest | DeepLesion | Bacteria | MedSG (multi-view) | MedSG (tracking) | Avg. |
|---|---|---|---|---|---|---|
| InternVL3-8B | 10.1 | 0.0 | 0.7 | 6.3 | 13.0 | 5.6 |
| Fleming-VL-8B | 0.0 | 0.0 | 8.3 | 42.0 | 36.7 | 17.2 |
| Lingshu-7B | 5.3 | 0.7 | 0.0 | 28.3 | 38.7 | 13.9 |
| Qwen3VL-8B | 16.4 | 0.0 | 9.16 | 8.4 | 17.8 | 13.8 |
| MedMO-8B (Ours) | 8.83 | 38.5 | 54.6 | 75.8 | 77.2 | 54.2↑+40.4 |
A powerful open-source post-trained multimodal large VLM designed for comprehensive medical image understanding and grounding, available in 4B and 8B variants.
Curated 26M+ multimodal medical samples from 45 datasets with a multi-stage post-training pipeline that progressively enhances cross-modal alignment.
Constructed a dedicated Cell dataset from open-source microscopy images with varying sizes, shapes, and densities for evaluating VLM detection capabilities.
Extensive experiments across data and methodology dimensions, providing an open benchmark for future multimodal medical LLM research.
MedMO demonstrates superior diagnostic accuracy and clinical reasoning